PERF: DataFrame.values for pyarrow-backed numeric types #52348

lukemanley · 2023-04-01T16:21:02Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Perf improvement for DataFrame.values when backed by a single pyarrow numeric dtype without any nulls. I realize this is a narrow use case, so happy to close this PR if it isn't worth special casing. The current slowness is due to DataFrame.values always casting to object dtype for EA-backed frames. Unfortunately, a single null anywhere in the dataframe misses this optimization since pd.NA is used as the null representation in the ndarray.

import pandas as pd
import numpy as np

data = np.random.randn(100_000, 20)
df = pd.DataFrame(data, dtype="float64[pyarrow]")

%timeit df.values

# 98.7 ms ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)   <- main
# 3.56 ms ± 96.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <- PR

phofl · 2023-04-02T15:18:43Z

This also changes behavior (e.g. getting float instead of object). Personally, I think this is fine but we have an unresolved discussion about this somewhere. We should decide there first before special casing here I'd say

lukemanley · 2023-04-02T15:50:15Z

This also changes behavior (e.g. getting float instead of object).

Yes, the performance improvement is due to avoiding the cast to object. Note, this behavior actually already exists on main for a DataFrame with a single column:

Behavior on main:

import pandas as pd

df1 = pd.DataFrame({"a": [1.0, 2.0, 3.0]}, dtype="float64[pyarrow]")
df2 = pd.DataFrame({"a": [1.0, 2.0, pd.NA]}, dtype="float64[pyarrow]")

print(df1.values.dtype)  # float64
print(df2.values.dtype)  # object

Personally, I think this is fine but we have an unresolved discussion about this somewhere. We should decide there first before special casing here I'd say

Sure, I think you might be referring to #22791

lukemanley · 2023-04-09T23:38:12Z

closing for now pending further discussion in #22791

PERF: DataFrame.values for pyarrow-backed numeric types

2ae7271

lukemanley added Performance Memory or execution speed performance Arrow pyarrow functionality labels Apr 1, 2023

lukemanley closed this Apr 9, 2023

lukemanley deleted the perf-df-values-arrow branch April 18, 2023 11:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: DataFrame.values for pyarrow-backed numeric types #52348

PERF: DataFrame.values for pyarrow-backed numeric types #52348

lukemanley commented Apr 1, 2023

phofl commented Apr 2, 2023

lukemanley commented Apr 2, 2023

lukemanley commented Apr 9, 2023

PERF: DataFrame.values for pyarrow-backed numeric types #52348

PERF: DataFrame.values for pyarrow-backed numeric types #52348

Conversation

lukemanley commented Apr 1, 2023

phofl commented Apr 2, 2023

lukemanley commented Apr 2, 2023

lukemanley commented Apr 9, 2023